The lecture material for today mentioned work by Freelon (2018), Bruns (2019), Puschmann (2019), and Lazer et al. (2020) as well as a report by SOMA outlining solutions for research data exchange.
These example tasks use different sources of online data, and here I introduce you to how we might gather data through both screen-scraping (or server-side) techniques as well as API (or client-side) techniques.
In this tutorial, you will learn how to:
In order to use the Twitter Academic Research Product Track you will first need to obtain an authorization token. You will find details about the process of obtaining authorization here.
In order to gain authorization you first need a Twitter account.
First, Twitter will ask for details about your academic profile. Per the documentation linked above, they will ask for the following:
Your full name as it is appears on your institution’s documentation
Links to webpages that help establish your identity; provide one or more of the following:
- A link to your profile in your institution’s faculty or student directory
- A link to your Google Scholar profile
- A link to your research group, lab or departmental website where you are listed
Information about your academic institution: its name, country, state, and city
Your department, school, or lab name
Your academic field of study or discipline at this institution
Your current role as an academic (whether you are a graduate student, doctoral candidate, post-doc, professor, research scientist, or other faculty member)
Twitter will then ask for details of the proposed research project. Here, questions include:
What is the name of your research project?
Does this project receive funding from outside your academic institution? If yes, please list all your sources of funding.
In English, describe your research project. Minimum 200 characters.
In English, describe how Twitter data via the Twitter API will be used in your research project. Minimum 200 characters.
In English, describe your methodology for analyzing Twitter data, Tweets, and/or Twitter users. Minimum 200 characters.
Will your research present Twitter data individually or in aggregate?
In English, describe how you will share the outcomes of your research (include tools, data, and/or other resources you hope to build and share). Minimum 200 characters.
Will your analysis make Twitter content or derived information available to a government entity?
Once you have gained authorization for your project you will be able to see the new project on your Twitter developer portal. First click on the developer portal as below.
Here you will see your new project, and the name you gave it, appear on the left hand side. Once you have associated an App with this project, it will also appear below the name of the project. Here, I have several Apps authorized to query the basic API. I have one App, named “gencap,” that is associated with my Academic Research Product Track project.
When you click on the project, you will first see how much of your monthly cap of 10m tweets you have spent. You will also see the App associated with your project below the monthly tweet cap usage information.
By clicking on the Settings icons for the App, you will be taken through to the information about the App associated with the project. Here, you will see two options listed, for “Settings” and “Keys and Tokens.”
Beside the panel for Bearer Token, you will see an option to Regenerate the token. You can do this if you have not stored the information about the token and no longer have access to it. It is important to store information on the Bearer Token to avoid having to continually regenerate the Bearer Token information.
Once you have the Bearer Token, you are ready to use academictwitteR!
Before proceeding, we’ll load the remaining packages we will need for this tutorial.
library(tidyverse) # loads dplyr, ggplot2, and others
library(academictwitteR) # to query the Academic Research Product Track Twitter v2 API endpoint in RThe Academic Research Product Track permits the user to access larger volumes of data, over a far longer time range, than was previously possible. From the Twitter documentation:
“The Academic Research product track includes full-archive search, as well as increased access and other v2 endpoints and functionality designed to get more precise and complete data for analyzing the public conversation, at no cost for qualifying researchers. Since the Academic Research track includes specialized, greater levels of access, it is reserved solely for non-commercial use.”
The new “v2 endpoints” refer to the v2 API, introduced around the same time as the new Academic Research Product Track. Full details of the v2 endpoints are available here.
In summary the Academic Research product track allows the authorized user:
academictwitteRWe begin by storing our access token with:
bearer_token = "AAAAAAAAAAAAAAAAAAAAA_INSERT_YOUR_TOKEN_HERE"The workhorse function of academictwitteR when it comes to collecting tweets containing a particular string or hashtag is get_all_tweets().
tweets <-
get_all_tweets(
"#BLM OR #BlackLivesMatter",
"2020-01-01T00:00:00Z",
"2020-01-05T00:00:00Z",
bearer_token,
file = "blmtweets"
)Here, we are collecting tweets containing one or both of two hashtags related to the Black Lives Matter movement over the period January 1, 2020 to January 5, 2020.
academictwitteRGiven the sizeable increase in the volume of data potentially retrievable with the Academic Research Product Track, it is advisable that researchers establish clear storage conventions to mitigate data loss caused by e.g. the unplanned interruption of an API query.
We first draw your attention first to the file argument in the code for the API query above.
In the file path, the user can specify the name of a file to be stored with a “.rds” extension, which includes all of the tweet-level information collected for a given query.
Alternatively, the user can specify a data_path as follows:
tweets <-
get_all_tweets(
"#BLM OR #BlackLivesMatter",
"2020-01-01T00:00:00Z",
"2020-01-05T00:00:00Z",
bearer_token,
data_path = "data/"
bind_tweets = FALSE
)In the data path, the user can either specify a directory that already exists or name a new directory. In other words, if there is already a folder in your working directory called “data” then get_all_tweets will find it and store data there. If there is no such directory, then a directory named (here) “data” will be created in your working directory for the purposes of data storage.
The data is stored in this folder as a series of JSONs. Tweet-level data is stored as a series of JSONs beginning “data_”; User-level data is stored as a series of JSONs beginning “users_.”
Note that the get_all_tweets() function always returns a data.frame object unless data_path is specified and bind_tweets is set to FALSE. When collecting large amounts of data, we recommend using the data_path option with bind_tweets = FALSE. This mitigates potential data loss in case the query is interrupted, and avoids system memory usage errors.
Users can then use the bind_tweet_jsons and bind_user_jsons convenience functions to bundle the JSONs into a data.frame object for analysis in R as such:
tweets <- bind_tweet_jsons(data_path = "data/")users <- bind_user_jsons(data_path = "data/")Let’s say, as an example, we queried the Twitter API with the following code:
get_all_tweets(
"#BLM OR #BlackLivesMatter",
"2020-01-01T00:00:00Z",
"2020-01-05T00:00:00Z",
bearer_token,
data_path = "data/academictwitteR_data",
file = "data/blmtweets",
bind_tweets = F
)We can then look at the output in our directory of JSON files like this:
list.files("data/academictwitteR_data")## [1] "data_1212161860600041475.json" "data_1212202218138374144.json"
## [3] "data_1212454140183547909.json" "data_1212541396005064704.json"
## [5] "data_1212602429998583808.json" "data_1212745962499780610.json"
## [7] "data_1212848819668340737.json" "data_1212931998357966848.json"
## [9] "data_1213102352925970433.json" "data_1213244966530502656.json"
## [11] "data_1213425494068285442.json" "query"
## [13] "users_1212161860600041475.json" "users_1212202218138374144.json"
## [15] "users_1212454140183547909.json" "users_1212541396005064704.json"
## [17] "users_1212602429998583808.json" "users_1212745962499780610.json"
## [19] "users_1212848819668340737.json" "users_1212931998357966848.json"
## [21] "users_1213102352925970433.json" "users_1213244966530502656.json"
## [23] "users_1213425494068285442.json"
, which we can then bind as follows:
blmtweets <- bind_tweet_jsons("data/academictwitteR_data")Or we can simply read in the data already stored in serialized format as a .rds file:
blmtweets <- readRDS("data/blmtweets.rds")And we’ll end up with something like this: